The impact of different imputation methods on estimates and model performance: an example using a risk prediction model for premature mortality

Objective To compare how different imputation methods affect the estimates and performance of a prediction model for premature mortality. Study Design and Setting Sex-specific Weibull accelerated failure time survival models were run on four separate datasets using complete case, mode, single and multiple imputation to impute missing values. Six performance measures were compared to access predictive accuracy (Nagelkerke R2, integrated brier score), discrimination (Harrell’s c-index, discrimination slope) and calibration (calibration in the large, calibration slope). Results The highest proportion of missingness for a single variable was 10.86% for the female model and 8.24% for the male model. Comparing the performance measures for complete case, mode, single and multiple imputation: the Nagelkerke R2 values for the female model was 0.1084, 0.1116, 0.1120 and 0.111–0.1120 with the male model exhibited similar variation of 0.1050, 0.1078, 0.1078 and 0.1078–0.1081. Harrell’s c-index also demonstrated small variation with values of 0.8666, 0.8719, 0.8719 and 0.8711–0.8719 for the female model and 0.8549, 0.8548, 0.8550 and 0.8550–0.8553 for the male model. Conclusion In the scenarios examined in this study, mode imputation performed well when using a population health survey compared to single and multiple imputation when predictive performance measures is the main model goal. To generate unbiased hazard ratios, multiple imputation methods were superior. This study shows the need to consider the best imputation approach for a predictive model development given the conditions of missing data and the goals of the analysis. Supplementary Information The online version contains supplementary material available at 10.1186/s12963-024-00331-3.


Introduction
Missing data is an inevitable challenge encountered in health surveys, which can compromise the representativeness of the sample, introduce bias, and reduce statistical power [1].Several factors contribute to missing data, including non-response, and survey administration errors.To address this issue, imputation methods have been developed, with several techniques employed in practice [2].The choice of imputation method depends on several factors, including the type and pattern of missing data, the assumptions about the missingness mechanism, and the specific goals of the analysis [1,2].
Prediction models are valuable tools that estimate the likelihood of future outcomes or events based on available data.These models serve diverse purposes in healthcare, clinical care and population health.Clinical risk prediction models assess individual patient risk and support treatment decisions, often relying on data from electronic patient records (e.g., blood pressure, bloodwork, genetic markers) [3].On the other hand, population risk algorithms predict disease incidence, evaluate the impact of risk factors, and inform population health interventions directed at groups of people versus at the individual level [4].The accuracy and reliability of a prediction model largely depends on the quality and representativeness of the data, which can be influenced by the presence of missing data and the methods used to address it [5].Existing prediction model reporting guidelines, such as the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), recommend reporting on missingness in development and validation datasets and how missing data were addressed [6].Despite these recommendations, the reporting and handling of missing data in prediction models is often inadequate [7][8][9].
Although there is existing literature on imputation methods in the context of survey data, there is a notable gap in our understanding regarding the impact of missing data for prediction models based on population surveys.Therefore, the objective of this study is to compare four common imputation methods, including complete case, mode imputation, single imputation, and multiple imputation, for handling missing values.This comparison aims to assess the effects of each imputation technique on model estimates and evaluate their impact on model performance.

Methods
The Premature Mortality Population Risk Tool (PreM-PoRT) [10,11] was developed and validated to predict the five-year incidence of premature mortality among Canadian adults.Model predictors included sociodemographic characteristics, self-perceived measures, health behaviours, and chronic conditions from national survey data.PreMPoRT demonstrated strong reproducibility and transportability in different validation data and performed well among important equity-stratified subgroups.Additional details about the development and validation of PreMPoRT are found elsewhere [10,11].
We apply four missing data approaches: complete case, mode imputation, single imputation and multiple imputation using fully conditional specification (FCS) [12].Six performance measures were used to assess the impact of each imputation method on the prediction model.

Data sources
PreMPoRT used data from the Canadian Community Health Survey (CCHS), a cross-sectional survey containing information on self-reported sociodemographic characteristics, health status, health care utilization, and health determinants.The surveyed population represents 98% of the Canadian population aged 12 and older [13] and uses a complex-survey design, including clustering and stratification, to represent all regions in Canada.The CCHS was linked to the Canadian Vital Statistics Database (CVSD) to ascertain premature mortality during a five-year follow-up period after CCHS interview date [14].Data were held at the Statistics Canada Research Data Centre.

Participants
The study cohort consisted of participants who contributed to any of the first six cycles, 1.

Model specification
PreMPoRT predicts premature mortality, which under the Canadian Institute of Health Information (CIHI), is defined as any death under the age of 75 [15].Using death dates from the CVSD, the outcome is all-cause mortality within five years after CCHS interview date or the participant's 75th birthday.PreMPoRT was developed using sex-specific Weibull accelerated failure time models.Participants were followed for five years after interview date, death or until 75 years old, whichever came first.
Using 38 candidate predictors [10], PreMPoRT identified 12 predictors for the female model and 13 predictors for the male model.Both models contained age, household income quintile, education level, self-perceived general health, cigarette smoking, emphysema/COPD, heart disease, diabetes, cancer, and stroke.Body-mass-index (BMI) and physical activity were unique to the female model with marital status, Alzheimer's disease, and arthritis being unique to the male model.
To accurately represent the Canadian population, CCHS survey weights were developed by Statistics Canada to handle the complex-survey design and to represent certain demographic groups properly [16].Since multiple cycles were used in the analysis, CCHS survey weights were pooled and divided by the number of cycles [17].

Imputing missing values: four approaches
We used four different missing data methods to impute missing values.The first method was complete case, where any participant that had any missing predictor(s) was removed from the analysis.The second was mode imputation, where within each sex-stratified CCHS cycle, the most common value for any predictor(s) was imputed as the missing value.
The third method was single imputation using FCS [12].Although PreMPoRT identified 12 predictors for females and 13 for males, imputation was run using all 38 candidate variables and the outcome [18].Imputation was run separately for each cycle with the addition of stratifying by sex.However, due to converge issues all chronic conditions that had a low prevalence and less than 1% missingness were imputed as the absence of the condition (i.e., mode imputation for variables with less than 1% missing).These chronic conditions included emphysema/ COPD, heart disease, diabetes, cancer, stroke, Alzheimer's disease and arthritis.Afterwards, FCS was run five times as burn-in iterations to find convergence of the imputed values to create the imputed dataset.FCS used different regression models for each variable type, including logistic regression for binary variables, discriminant function for nominal variables and ordinal logistic regression for ordinal variables with more than two categories.Each variable was imputed within each CCHS cycle with the exception of anxiety and mood disorder in the first cycle as these questions were not asked in that cycle.After imputing all other variables within each cycle, anxiety and mood disorder for the first cycle were imputed using the next two CCHS cycles, as these were the other cycles within the development dataset.When building a prediction model it is important to avoid leakage between development and validation sets, as such imputing within each CCHS cycle as well as imputing anxiety and mood disorder within just the development cycles avoids all leakage from imputation.
The final method was multiple imputation (MI) which applied the same approach as single imputation to create four additional datasets for a total of five.The goal of MI is to generate multiple imputed datasets to observe how the distribution of the imputed values affects the results of the model.

Model performance and measures
To compare the effects of the imputation methods on the prediction model, the Weibull specific model parameters, hazard ratios (HRs), and performance measures were compared.The Weibull model parameters include the scale and shape parameters as well as the intercept.The hazard ratios compare the proportional increase in the rate of premature mortality versus the reference group and were calculated for each predictor in the model.Finally, six performance measures were compared to assess the model's overall predictive accuracy, discrimination, and calibration.
The Nagelkerke R 2 and Integrated Brier Score were used to assess the predictive accuracy.The Nagelkerke R 2 measures the percent of variance explained by the model with a target value of one.The Integrated Brier Score measures the average squared difference between the outcome and the predicted risk (while taking censoring into account) with a target value of zero.
Discrimination is how well the model can differentiate between those who experience an outcome versus those who did not.This was assessed using Harrell's concordance index (c-index) which is the fraction of the number of concordant pairs over the number of concordant pairs and discordant pairs [18].A pair compares two participants in the study, and if the individual who had an event first had a higher predicted risk (concordant pair) the model properly predicted the outcome.However, if that individual had a lower predicted risk (discordant pair) then the model did not properly predict the outcome.Discrimination will also be assessed using time-specific discrimination slope, which is the difference in the average predicted risk of those who had an event and those who did not have an outcome.
Finally, calibration will be measured in the large and calibration slope.Calibration in the large is the difference between the average observed risk (normally calculated using Kaplan-Meier curves) and the average predicted risk.The calibration slope assesses if the betas are wellcalibrated for the model.A slope of one indicates perfect calibration, less than one indicates the betas are overestimating the predicted risk, and more than one indicates the betas are underestimating the predicted risk.In addition, calibration plots were produced to show further the effect of imputation methods on the calibration of the prediction model.

Results
The highest proportion of missingness in any one variable was 10.86% for the female model and 8.24% for the male model.All chronic conditions, marital status, selfperceived general health and physical activity had less than 1% missingness.BMI, smoking status and individual education all had between 1% and 5% missingness, with income quintiles being the only variable with more than 5% missingness.

Baseline characteristics
Table 1 shows the weighted percent of baseline characteristics with unweighted total counts from the cohort rounded to the nearest thousand to adhere to Statistics Canada's export requirements.All datasets have a total of 267,000 for females and 233,000 for males, except for the complete case, which had a total of 221,000 (17% removed) for females and 195,000 (16% removed) for males.Across all imputations, a total of 1.41% females and 2.06% of males experienced premature death, except for complete case (1.27% premature deaths for females and 1.93% premature deaths for males).There were no notable differences across imputation methods, apart from household income quintiles, which had a missingness of 10.86% for females and 8.24% for males.The lowest income quintile for females had a missingness of 15.11% for complete case and 15.50 − 21.14% for the remaining imputation methods.For males, the biggest difference was in the highest income quintile, with 29.24% missingness for complete case and 28.48 − 31.92% for other imputation methods.

Performance measures
Table 2 shows the variation in performance measures when applying the four imputation methods.The Nagelkerke R 2 for the female model was 0.1084 for complete case, with the remaining imputation methods ranging between 0.1111 and 0.1120.The Nagelkerke R 2 for the male model was 0.1050 for complete case and a range of 0.1078-0.1081for other imputation methods.The c-index results were as follows: complete case was 0.8666, for females and 0.8549 for male, with the remaining methods giving a range of 0.8711-0.8719and 0.8548-0.8553for females and males, respectively.
The performance measures for calibration changed minimally across imputation methods.In addition to the performance measures, Figs. 1 and 2 show the average observed risk of premature mortality against the predicted risk of the model for females and males, respectively.Predicted risk is shown in deciles and the percent of observed cases that had a premature death in each decile was reported.Perfect calibration represents a slope of 1.The supplementary materials contain additional calibration plots from select predictors, including age groups, education level, ethnicity, immigration status and material deprivation.These show the percentage of premature deaths and compare them to the average predicted risk from each imputation method.

Hazard ratios and confidence intervals
Table 3 shows the Weibull parameters and the HRs for the female and male models by imputation method.The female scale parameter was 0.7852 for complete case, and varied from 0.8194 to 0.8200 for the remaining imputation methods.The male scale parameter was 0.8137 for complete case and ranged from 0.8468 to 0.8472 for the other imputation methods.The HRs for all chronic conditions, age, self-perceived general health, cigarette smoking, physical activity, and marital status remained relatively unchanged between the imputation methods with the exception of complete case which did show noticeable differences across almost all predictors.Excluding complete case, household income demonstrated the biggest difference in confidence intervals for the female model.Specifically, the lowest income quintile (Q1) ranged from 1.22 to 1.26 for mode imputation to 1.09-1.33for multiple imputation.The second highest quintile (Q4) ranged from 1.11 to 1.15 for mode imputation to 0.95-1.16for multiple imputation.We observed similar variation for the male model with the lowest income quintile (Q1) ranging from 1.36 to 1.40 for mode imputation to 1.31-1.54for multiple imputation and, the second highest quintile (Q4) ranged from 1.12 to 1.15 for mode imputation to 1.10-1.19for multiple imputation.

Discussion
Although there are other imputation methods involving machine learning, this study aimed to investigate the effects of four missing data techniques on model coefficients and performance from a linked health survey.Our findings suggest that complete case imputation is not suitable for handling missing data when developing a prediction model.Interestingly, performance measures exhibited minimal changes across mode, single and multiple imputation.However, multiple imputation proved essential in obtaining accurate HRs and confidence intervals for predictors with a higher degree of missingness.

Complete case
Although complete case imputation is a commonly used technique for handling missing data, it is known to produce bias estimates and large standard errors when the missing data is not Missing Completely At Random (MCAR) [2].In our study, the prevalence of premature deaths was reduced despite only removing a relatively   small amount of the cohort.This observation suggests that individuals who experienced a premature death were more likely to have missing information, indicating a failure of the MCAR assumption.This bias is also particularly evident in the Weibull scale parameter.While mode, single, and multiple imputation demonstrated only minor variations in the scale values, complete case imputation exhibited noticeable differences.
Given that the scale parameter directly impacts the baseline survival, even slight changes can result in differences in predicted probabilities.The Nagelkerke R 2 , c-index and calibration-in-the-large all indicated poorer performance in the models using complete case imputation compared to mode, single, and multiple imputation, both for the female and male models.These results strongly Multiple Imputation created 5 datasets, as such this the range from the minimum percent within each dataset to the maximum percent in each dataset 2 Physical activity was measured using average metabolic equivalent of task (MET) per day derived from a list of leisure-time physical activities (frequency and duration of activity)

Table 1 (continued)
suggest that complete case imputation is an inadequate method and should be avoided [2,19].

Comparing performance measures
The results demonstrate similar performance when comparing mode, single, and multiple imputation techniques, with only marginal differences observed.This suggests that for risk prediction, single and multiple imputation offer minimal to no discernable benefit to model performance compared to mode imputation.Furthermore, when examining the calibration plots, all approaches tend to overpredict premature mortality at the higher-risk groups.This is due to less than 2% of the population having a risk greater than 20% risk of a five-year premature mortality.However, the variations between the different imputation methods are relatively minor, suggesting that the choice of imputation method has limited impact on the calibration of the models.

Comparing hazard ratios and confidence intervals
When comparing the imputation methods, the differences in HRs and confidence intervals are heavily influenced by the percent of missingness in each variable.
Variables with less than 1% missingness, such as marital status, self-perceived general health, physical activity, and chronic conditions, show minimal changes in HRs.Multiple imputation, however, tends to yield slightly larger confidence intervals due to the inclusion of additional variance from the HRs across the five imputed datasets.Predictors with a higher degree of missingness, but still below 5%, demonstrate larger changes in HRs and wider ranges in the confidence intervals when employing multiple imputation.These predictors include individual education level, BMI, and smoking status.Household income surpassed 5% missingness and exhibits notable differences in the female model.Multiple imputation showed the confidence intervals were underestimated in mode and single imputation.While all income quintiles, except the lowest income group (Q1), were found to be statistically significant in mode and single imputation, they were no longer statistically significant when using multiple imputation.For males, household income remained nearly unchanged between mode, single, and multiple imputation, just with larger confidence intervals.Consequently, variables with higher levels of missingness can exhibit unpredictable variations in whether their effects differ across different imputed datasets or remain consistent.

Limitations
This study should be interpreted considering the following limitations.First, individuals residing in the territories had to be removed given that area-based measures and household income were completely missing.Second, due to convergence issues with multiple imputation, all chronic conditions with low percent missingness were assigned the absence of the given condition (the most common occurrence in the data) and thus the effects of the different imputation methods could not be properly tested for these predictors.The highest missingness of a single variable was less than 11% and thus we could not compare the difference for variables with larger missingness.It is important when encountering data with higher The Nagelkerke R2 measures the percent of variance explained by the model with a target value of one 2 The Integrated Brier Score measures the average squared difference between the outcome and the predicted risk (while taking censoring into account) with a target value of zero 3 Harrell's concordance index (c-index) which is the fraction of the number of concordant pairs over the number of concordant pairs and discordant pairs 4 Time-specific discrimination slope is the difference in the average predicted risk of those who had an event and those who did not have an outcome 5 Calibration in the large is the difference between the average observed risk (normally calculated using Kaplan-Meier curves) and the average predicted risk 6 The calibration slope assesses if the betas are well-calibrated for the model (with a slope of one indicating perfect calibration) Fig. 2 Calibration plot of predicted risk deciles versus average observed risk of premature mortality for males

Conclusions
When dealing with missing data in population-based studies, the choice of imputation method depends on the specific goals of the analysis.Researchers should consider the trade-offs between simplicity and accuracy when selecting the appropriate imputation method for their analysis.Both single imputation and multiple imputation are complex imputation methods, requiring more time and methodological knowledge to properly impute missing data.As such, when working with population-based data with similar missingness, if the reader is solely interested in the overall performance of the model and not the individual effects of the predictors, mode imputation is an option.However, if an accurate estimation of predictor effects is of interest, the selection of the imputation method should consider the percentage of missingness in the variables.When predictors have a small percentage of missing values (less than 5%), then mode imputation is satisfactory.Once predictors have a higher percentage of missingness (5% or more), imputed values will introduce greater variability.In such cases, multiple imputation becomes essential to capture the effect of the imputed values accurately.

Fig. 1
Fig. 1 Calibration plot of predicted risk deciles versus average observed premature mortality for females The study received ethics approval from the University of Toronto Research Ethics Board (Protocol #37499).This work was supported by the Canadian Institutes of Health Research Operating Grants (FRNs: 72056684 and 72051628).Laura Rosella is also supported by a Canada Research Chair in Population Health Analytics (FRN: 72060091).
Individuals were removed from the cohort if they were pregnant or living in the Territories (Nunavut, Northwest Territories and Yukon) at the time of their CCHS interview date.Since PreMPoRT was developed for the Canadian adult population, individuals under 18 years old or over 75 were excluded.

Table 1
Baseline characteristics

Table 2
Performance measures

Table 3
Hazard ratios

Table 3
(continued)levels of missing to note that the results here may not apply.